DesignCon 2005 Hardware Implementation of a Tree Based IP Lookup Algorithm for OC-768 and beyond

نویسنده

  • Florin Baboescu
چکیده

Continuing growth in link speeds and the number of advertised IP prefixes places increasing demands on the performance of Internet routers. An Internet Service Provider (ISP) requires routers to be able to accommodate up to 500,000 prefixes and very likely 1M in the near future. Also the prefix length is expected to grow from (up to) 32 bits to 128 bits with the introduction of IPv6. While ternary-CAM based lookup solutions serve the needs of today's routers, they do not scale well for the next-generation. TCAMs are expensive and consume a lot of power. Algorithmic solutions amenable to software implementation are scalable and cost-effective, but cannot deliver the throughput needed by OC-768 links. A pipelined hardware implementation of an algorithmic solution can deliver high throughput, store a large number of entries with wide entry lengths and is cost-effective and scalable. This paper describes single chip pipeline architecture for tree based searches that can support IP lookups at a rate of up to 250M searches/s on routing tables with up to 512K entries (either IPv4 or IPv6). It is used in conjunction with a multi-bit trie based algorithmic solution. The architecture relies on an embedded CPU to offload NPU table management functions. On chip firmware allows fast incremental prefix updates, without affecting the search rate. The update rate is at least an order of magnitude better than that required by the IP protocols. Our solution has separate 36-bit QDRII interfaces to connect to the NPU and to an external SRAM (for next hop data). The two high-speed interfaces running at 250MHz rely on source-synchronous DLL clocking to deliver 500Mbits/sec/pin. Internally, it relies on a deeply pipelined search path to deliver search throughput of up to 250Msps. In addition, to meet the memory access needs of the algorithm an innovative highly configurable embedded SRAM array was designed. It is able to perform up to 4 billion reads/second, with an internal bandwidth of 40GB/s. It also offers extremely high reliability with ST's patented Robust SRAM process to reduce Soft Error rate. In conjunction with our multi-bit trie algorithm for IP lookups it offers a capacity similar to that of an 18Mbit TCAM at a much lower cost and 1/6 the power of the TCAM. The paper is organized as follows. It first introduces the IP lookup problem and presents a brief background on algorithmic solutions. It also contrasts the use of algorithmic solutions against TCAM based solutions and then describes our solution based on a compressed representation of a multi-bit trie based search structure. Then challenges of a hardware implementation of the algorithm (to meet OC-768 requirements) are described along with our solution and architecture. The role of the embedded CPU in memory allocation and memory management during pipelined tree based searches is described and the paper highlights our firmware solution for dynamic memory allocation and the design of an efficient low cost memory manager. We conclude with a description of the design, modeling and verification methodology and share experiences that may be used in future complex SoC designs. Author(s) Biography Nick Richardson is a Fellow at STMicroelectronics, and the manager of the Central R&D Advanced Designs Group in San Diego. He has an extensive background in definition and design of CPU micro-architectures, cache controllers, and I/O subsystems and holds over 15 US patents. He graduated from the University of London. Lun bin Huang is a Sr. Principal Engineer at the ST Microelectronics Central R&D Advanced Designs Group in San Diego. He has over 15 yrs experience in the areas of architecture, micro-architecture, modeling, design, and implementation of digital devices. He has a BSEE in Computer Science and Engineering and an MSEE degree in Communications from California State University, Long Beach. Suresh Rajgopal is a Principal Engineer the ST Microelectronics Central R&D Advanced Designs Group in San Diego. He has over 11 yrs of experience working in the areas of design automation, low-power design, SoC and CPU architecture, design and modeling. He has a Ph.D. in Computer Science and Engineering from the University of North Carolina at Chapel Hill. Florin Baboescu has been with STMicroelectronics since the summer of 2002. His area of expertise is Computer Networks, Computer Architectures and Operating Systems. In Computer Networks he developed algorithms for IP lookups/classification, routing and scheduling while in the field of Computer Architecture he was one of the first to develop a memory system for a Simultaneous Multithreading (SMT) system. He has a PhD in Computer Science and Engineering from the Univ. of California, San Diego. Introduction The rapid growth of the Internet has brought great challenges in deploying highspeed networks. One particular challenge is to provide high packet forwarding rates through the router. Network search engines capable of providing IP lookup, VPN forwarding, or packet classification are a major component of every router. With the increase in link speeds, increase in the number of advertised IP prefixes, and deployment of new network services the demands placed on these network search engines are increasingly causing them to become a potential bottleneck for the router. According to a recent survey from Linley Group the search engine market grew a healthy 14% from 83 million USD in 2002 to 95 million USD in 2003 with the potential of exceeding 200 million USD in 2007. Most of the search engine designs use either dedicated ASICs or Ternary CAMs. Hardware based solutions based on Ternary CAMs [1] have been the most popular implementation in routers, due to their efficiency. TCAMs are content addressable memories in which each bit is allowed to store a 0, 1 or a ``don't care'' value. A TCAM compares each packet address with every address the search engine holds in its database, using parallel lookups on associative memory. However TCAMs have their limitations: (1) large cell size (about 16 transistors per bit), (2) significant high power consumption (a 18Mbits TCAM consumes about 15W at 133Msps), (3) very high cost ($300 for a 18Mbits TCAM) and (4) can not provide an efficient scalable single chip solution for a search that requires storing more than 128,000 IPv6 prefixes. Algorithmic based solutions are an attractive alternative to overcome these limitations. They are scalable, low-power, cost-effective and amenable to software implementation. But they suffer from their own problems, viz. long latencies, unpredictable capacity due to search sensitivity and an inability to keep up with the everincreasing lookup rates demanded by NPUs. Most algorithmic-based solutions for network searches can be regarded as some form of tree traversal, where the search starts at the root node, traverses various levels of the tree, and typically ends at a leaf node. For example, most Internet packets require a longest matching prefix of a 32-bit destination address in a prefix table of about several hundred thousands prefixes. The most common data structure for doing prefix lookups is some form of trie, where a trie is a tree where branching decisions are made based on values of successive bits in the destination address. Using a single computational logic that we call a processing block, even the fastest of these lookup schemes take at least 8 memory accesses per lookup. As link speeds scale rapidly to OC-768, a packet look-up is required every 4 ns. With memory speeds increasing very slowly in comparison, IP lookup will soon become a bottleneck for core routers. Fortunately, every tree based lookup scheme can be pipelined using several processing blocks. Part of the lookup for the first packet is done by the first block in the pipeline, and then the packet data is passed to the second processing block while the first block works on the second packet. Thus several modern routers use algorithmic solutions that perform one lookup every memory access after pipeline fills. In this way one can handle OC-768 using pipelined forwarding engine. This also makes the solution amenable to hardware implementation. However, although the idea sounds simple, it has several limitations that need to be addressed. An important limitation is the memory allocation to the pipeline stages. In order to allow the execution of one lookup for every pipeline cycle the pipeline stages should not try to access the same memory space. Allocating single chip memories for every pipeline stage is a natural solution in providing zero memory contention. However, when the lookup chip is fabricated one must know how big the memories allocated to each pipeline stage should be. Compounded with this fact, is the challenge of being able to accommodate this solution on a single-chip implementation that can justify the low-cost and low-power of an algorithmic solution. Basu et al. [2] is one of several papers in this field that identifies memory balance as a critical issue in the design of IP lookup engines. Their technique to reduce memory imbalance is to design the tree structure to minimize the stage that has the largest memory. However, assuming a fixed size trie and an 8 stage pipeline for IPv4 their results show that for different IP databases, the memory in some stages still varies dramatically both from stage to stage in the same application, as well as between same level stages in different applications. Even with their new algorithm, the memory allocated to one stage varies from nearly 0 to 150Kbytes for various IP tables (of sizes between 100,000 and 130,000 prefixes). The worst case bound for a million prefixes is 11 Mbytes per stage or 88 Mbytes across all eight stages. Sizing the ASIC for the worstcase memory bound would be extremely expensive. In some cases, the total allocated memory can be an order of magnitude bigger than actually needed. To address this imbalance, one can use complex dynamic memory allocation schemes (which dramatically increase the hardware complexity) or over provision each of the pipeline stages (which results in memory waste). The use of large, poorly utilized memory modules also results in high memory latencies, which can have a detrimental effect on the speed of each stage of the pipelined computation, and thus on the throughput of the entire architecture. This is one example of the kind of complex architectural decisions that we had to address in the design of a single-chip high speed IP lookup hardware solution. Our paper describes NSE (Network Search Engine), a single-chip, high-performance, pipelined implementation of a multi-bit trie based search algorithm. The algorithm itself (not discussed here) is an improvement in storage capacity over the one designed by Eatherton et al. [3]. The implementation may be used as a high-speed IP lookup Co-processor which has the primary application of accelerating IPv4 and IPv6 prefix look-ups. It can support searches at a rate of up to 250Msps on up to 4096 routing tables and can store up to 512K 32-bit Classless Inter-Domain Routing (CIDR) IPv4 prefixes. It also provides fast update operations (insert and delete) at a rate of 250K routes per second using an embedded CPU core. The rest of the paper is organized as follows. Section 2 introduces the IP lookup problem and describes the compressed multi-bit trie algorithm used in the NSE. In Section 3 the architecture and micro-architecture features of the hardware implementation of the algorithm are described. Section 4 introduces the role of the firmware and the embedded CPU in the device. We share some experimental results on routing table capacity in Section 5. The design methodology, modeling tasks and flows are described in Section 6, followed by conclusions in Section 7. 1 IP Lookup Problem and Algorithm Description Prefix Value Next Hop P1 0000001* NH1 P2 0000000000* NH2 P3 01101100* NH3 P4 0110110100* NH4 P5 0110110101* NH5 P6 11001* NH6 P7 111101000* NH7 P8 11110101* NH8 P9 11110101110* NH9 P10 01100* NH10 P11 011011* NH11 P12 * NH12 Table 1: A simple example of routing table with 12 prefixes The IP lookup operation requires a longest matching prefix computation at wire speeds. In IPv4 domain, for example, each 32 bit IP destination address of the received packet is matched using a database of IP prefixes. Each prefix entry consists of a prefix and a next hop value. For a better understanding of the problem let's consider the following simple example based on an IP lookup database consisting of the following 12 prefixes shown in Table 1. If the router receives a packet with the destination address that starts with 11110101110 then the next hop value associated with the prefix P9 is selected.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

Towards More Power Efficient Ip Lookup Engines

Towards More Power Efficient IP Lookup Engines. (December 2005) Seraj Ahmad, Bachelor of Technology, Indian Institute of Technology, Guwahati, India Chair of Advisory Committee: Dr. Rabi N. Mahapatra The IP lookup in internet routers requires implementation of the longest prefix match algorithm. The software or hardware implementations of routing trie based approaches require several memory acc...

متن کامل

High-speed IP routing with binary decision diagrams based hardware address lookup engine

With a rapid increase in the data transmission link rates and an immense continuous growth in the Internet traffic, the demand for routers that perform Internet protocol packet forwarding at high speed and throughput is ever increasing. The key issue in the router performance is the IP address lookup mechanism based on the longest prefix matching scheme. Earlier work on fast Internet Protocol V...

متن کامل

IP Lookup Using the Novel Idea of Scalar Prefix Search with Fast Table Updates

Recently, we have proposed a new prefix lookup algorithm which would use the prefixes as scalar numbers. This algorithm could be applied to different tree structures such as Binary Search Tree and some other balanced trees like RB-tree, AVL-tree and B-tree with minor modifications in the search, insert and/or delete procedures to make them capable of finding the prefixes of an incoming string e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005